# Cross-modal Retrieval
Unime LLaVA OneVision 7B
MIT
UniME is a general embedding learning framework based on multimodal large models, significantly enhancing multimodal embedding capabilities through text discriminative knowledge distillation and hard negative sample-enhanced instruction tuning strategies.
Multimodal Alignment
Transformers English

U
DeepGlint-AI
376
2
Unime LLaVA 1.6 7B
MIT
UniME is a general embedding learning model based on a multimodal large model, trained with 336×336 image resolution and ranked first on the MMEB leaderboard.
Image-to-Text
Transformers English

U
DeepGlint-AI
188
3
Omniembed V0.1
MIT
A multimodal embedding model based on Qwen2.5-Omni-7B, supporting unified embedding representations for cross-lingual text, images, audio, and video
Multimodal Fusion
O
Tevatron
2,190
3
Mme5 Mllama 11b Instruct
MIT
mmE5 is a multimodal multilingual embedding model trained on Llama-3.2-11B-Vision, improving embedding performance through high-quality synthetic data and achieving state-of-the-art results on the MMEB benchmark.
Multimodal Fusion
Transformers Supports Multiple Languages

M
intfloat
596
18
Conceptclip
MIT
ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts, suitable for various medical imaging modalities, capable of achieving robust performance across multiple medical imaging tasks.
Image-to-Text
Transformers English

C
JerrryNie
836
1
Mexma Siglip
MIT
MEXMA-SigLIP is a high-performance CLIP model combining multilingual text encoder and image encoder, supporting 80 languages.
Text-to-Image Supports Multiple Languages
M
visheratin
137
3
LLM2CLIP Llama 3 8B Instruct CC Finetuned
Apache-2.0
LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.
Multimodal Fusion
L
microsoft
18.16k
35
RS M CLIP
MIT
A multilingual vision-language pre-trained model for the remote sensing field, supporting image-text cross-modal tasks in 10 languages.
Image-to-Text Supports Multiple Languages
R
joaodaniel
248
1
Video Llava
A large-scale vision-language model based on Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
V
AnasMohamed
194
0
Nomic Embed Vision V1.5
Apache-2.0
High-performance visual embedding model, sharing the same embedding space with nomic-embed-text-v1.5, supporting multimodal applications
Text-to-Image
Transformers English

N
nomic-ai
27.85k
161
Nomic Embed Vision V1
Apache-2.0
High-performance vision embedding model, sharing the same embedding space with nomic-embed-text-v1, supporting multimodal applications
Text-to-Image
Transformers English

N
nomic-ai
2,032
22
Clip ViT B 32 Vision
MIT
ONNX ported version based on CLIP ViT-B/32 architecture, suitable for image classification and similarity search tasks.
Image Classification
Transformers

C
Qdrant
10.01k
7
M3D CLIP
Apache-2.0
M3D-CLIP is a CLIP model specifically designed for 3D medical imaging, achieving visual and language alignment through contrastive loss.
Multimodal Alignment
Transformers

M
GoodBaiBai88
2,962
9
Blair Roberta Base
MIT
BLaIR is a language model pre-trained on the Amazon Reviews 2023 dataset, focusing on recommendation and retrieval scenarios, capable of generating powerful product text representations and predicting relevant products.
Text Embedding
Transformers English

B
hyp1231
415
3
Owlv2 Base Patch16
OWLv2 is a vision-language pre-trained model focused on object detection and localization tasks.
Object Detection
Transformers

O
Xenova
17
0
Internvl 14B 224px
MIT
InternVL-14B-224px is a 14B-parameter vision-language foundation model supporting various vision-language tasks.
Text-to-Image
Transformers

I
OpenGVLab
521
37
Languagebind Video Huge V1.5 FT
MIT
LanguageBind is a pretrained model that achieves multimodal semantic alignment through language, capable of binding various modalities such as video, audio, depth, and thermal imaging with language to enable cross-modal understanding and retrieval.
Multimodal Alignment
Transformers

L
LanguageBind
2,711
4
Vilt Finetuned 200
Apache-2.0
Vision-language model based on ViLT architecture, fine-tuned for specific tasks
Text-to-Image
Transformers

V
Atul8827
35
0
Languagebind Audio FT
MIT
LanguageBind is a language-centric multimodal pretraining method that achieves semantic alignment by using language as the bridge between different modalities.
Multimodal Alignment
Transformers

L
LanguageBind
12.59k
1
Languagebind Video Merge
MIT
LanguageBind is a multimodal model that extends video-language pretraining to N modalities through language-based semantic alignment, accepted by ICLR 2024.
Multimodal Alignment
Transformers

L
LanguageBind
10.96k
4
Metaclip B16 Fullcc2.5b
MetaCLIP is an implementation of the CLIP framework applied to CommonCrawl data, aiming to reveal CLIP's training data filtering methods
Text-to-Image
Transformers

M
facebook
90.78k
9
Metaclip B32 400m
The MetaCLIP base model is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.
Text-to-Image
Transformers

M
facebook
135.37k
41
Languagebind Image
MIT
LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment.
Multimodal Alignment
Transformers

L
LanguageBind
25.71k
11
Languagebind Depth
MIT
LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment across video, infrared, depth, audio, and other modalities.
Multimodal Alignment
Transformers

L
LanguageBind
898
0
Languagebind Video
MIT
LanguageBind is a multimodal pretraining framework that extends video-language pretraining to N modalities through language semantic alignment, accepted by ICLR 2024.
Multimodal Alignment
Transformers

L
LanguageBind
166
2
Languagebind Thermal
MIT
LanguageBind is a pretraining framework that achieves multimodal semantic alignment through language as the bond, supporting joint learning of various modalities such as video, infrared, depth, and audio with language.
Multimodal Alignment
Transformers

L
LanguageBind
887
1
FLIP Base 32
Apache-2.0
This is a vision-language model based on the CLIP architecture, specifically post-trained on 80 million face images.
Multimodal Fusion
Transformers

F
FLIP-dataset
16
0
CLIP Giga Config Fixed
MIT
A large CLIP model trained on the LAION-2B dataset, using ViT-bigG-14 architecture, supporting cross-modal understanding between images and text
Text-to-Image
Transformers

C
Geonmo
109
1
Clip Vit Base Patch32
CLIP model developed by OpenAI, based on Vision Transformer architecture, supporting joint understanding of images and text
Text-to-Image
Transformers

C
Xenova
177.13k
8
Clip Vit Base Patch16
OpenAI's open-source CLIP model, based on Vision Transformer architecture, supporting cross-modal understanding of images and text
Text-to-Image
Transformers

C
Xenova
32.99k
9
CLIP ViT L 14 CommonPool.XL.laion S13b B90k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the laion dataset
Text-to-Image
C
laion
176
1
CLIP ViT B 16 DataComp.L S1b B8k
MIT
A zero-shot image classification model based on the CLIP architecture, trained using the DataComp dataset, supporting efficient image-text matching tasks.
Text-to-Image
C
laion
1,166
1
CLIP ViT B 16 CommonPool.L.text S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
58
0
Eva Giant Patch14 Plus Clip 224.merged2b S11b B114k
MIT
EVA-Giant is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Text-to-Image
E
timm
1,080
1
Eva02 Large Patch14 Clip 336.merged2b S6b B61k
MIT
EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Text-to-Image
E
timm
15.78k
0
Xclip Large Patch14 16 Frames
MIT
X-CLIP is an extension of CLIP for general video-language understanding, achieving video classification and video-text retrieval tasks through contrastive learning.
Text-to-Video
Transformers English

X
microsoft
678
3
Clip Vit Large Patch14 336
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
Transformers

C
openai
5.9M
241
Mengzi Oscar Base
Apache-2.0
A Chinese multimodal pretraining model built on the Oscar framework, initialized with Mengzi-Bert base version, trained on 3.7 million image-text pairs.
Image-to-Text
Transformers Chinese

M
Langboat
20
5
M BERT Base ViT B
A multilingual CLIP text encoder fine-tuned from BERT-base-multilingual, supporting alignment with CLIP visual encoder across 69 languages
Multimodal Alignment
M
M-CLIP
3,376
12
Featured Recommended AI Models